Arabic Text Classification Framework Based on Latent Dirichlet Allocation

نویسندگان

  • Mounir Zrigui
  • Rami Ayadi
  • Mourad Mars
  • Mohsen Maraoui
چکیده

Current research usually adopts Vector Space Model to represent documents in Text Classification applications. In this way, document is coded as a vector of words; n-grams. These features cannot indicate semantic or textual content; it results in huge feature space and semantic loss. The proposed model in this work adopts a “topics” sampled by LDA model as text features. It effectively avoids the above problems. We extracted significant themes (topics) of all texts, each theme is described by a particular distribution of descriptors, then each text is represented on the vectors of these topics. Experiments are conducted using an in-house corpus of Arabic texts. Precision, recall and F-measure are used to quantify categorization effectiveness. The results show that the proposed LDA-SVM algorithm is able to achieve high effectiveness for Arabic text classification task (Macro-averaged F1 88.1% and Micro-averaged F1 91.4%).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

An arabic lemma-based stemmer for latent topic modeling

Developments in Arabic information retrieval did not follow the increasing use of the Arabic Web during the last decade. Semantic indexing in a language with high inflectional morphology, such as Arabic, is not a trivial task and requires a text analysis in the original language. Excepting cross-language retrieval methods or limited studies, the main efforts, for developing semantic analysis me...

متن کامل

Bilingual Chronological Classification of Hafez's Poems

We present a novel task: the chronological classification of Hafez’s poems (ghazals). We compiled a bilingual corpus in digital form, with consistent idiosyncratic properties. We have used Hooman’s labeled ghazals in order to train automatic classifiers to classify the remaining ghazals. Our classification framework uses a Support Vector Machine (SVM) classifier with similarity features based o...

متن کامل

Topic Modeling of Phonetic Latin-Spelled Arabic for the Relative Analysis of Genre-Dependent and Dialect-Dependent Variation

We demonstrate a data collection and analysis system that can be used to analyze the relative contributions of dialect dependent variation in the lexical of speech-like Arabic text. We utilize Latent Dirichlet Allocation (LDA), a generative Probabilistic modeling method, to analyze a phonetic Latin Spelled Arabic online chat corpus. The corpus produces different word choices and word relations ...

متن کامل

Multi - label Classification Algorithm Based on Latent Dirichlet Allocation Model

Vector Space Model (VSM) is used frequently in Text Classification (TC). However, it is usually produces a high dimensional feature space which leads to huge cost of computation and storage. Recently, statistic topic model plays an important role in the field of Information Retrieval (IR), TC and Document Clustering. In this chapter, we try to use a kind of statistic model—Latent Dirichlet Allo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CIT

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2012